Project Overview

Industrial safety and Health-NLP based Chatbot.

Data Exploration/Cleansing

Description

Data Pre-processing

Observations:

Exploratory Data Analysis (EDA)

Let's analyze the data for Univariate, bivariate analysis:

Country

Accident Level

Gender

Industry Sector

Critical_Risk

Note:

Univariate analysis

Local

Local_03 has maximum number of accidents. 

Accident_Level

Level-1 have more accidents which are minor and Level-5 accidents are severe.

Potential_Accident_Level

Potential accident levevel indicates how severe the accident is and has the highest count in Level-4.
Level VI accident is having only one incident hence it can be replaced with Level V.

Employee_Type

Employee type indicates which employee has effected more. it is observed that Employees and Third party employees are effected more.

Bivariate analysis Observations:

'Potential accident level' by 'Country'

Country_01 has more number of severe accidents in Level IV
Country_02 has moderate accidents across all the levels
Country_03 level-I accident count is more compared to country_01 and 02 but less severe accidents.

'Local' by 'Potential accident level'

Local_03 is where most of the accidents happen

'Potential Accident level counts' by 'Employee_Type'

Third Party Employees are more effected in Accidents. We can observe that people are also facing severe accidents in Accident_Level_IV.

'Industry_sector' by 'Local'

Industry Sector depends on the Local area.
Local Area 1,2,3,4 and 7 belong to Mining Sector.
Local Area 5,6,8 and 9 belong to Metal Sector.
Local Area 10, 11 and 12 belong to Other Sectors.

'Potential Accident level' by 'Gender'

Males are more involved in Severe Accidents 

Accident Level by Country with Critical Risk

NLP Analysis/Pre-processing

Text Processing

N-gram, uni-gram, Bi-gram

Plot the Wordcloud for Clean_Description

Feature Engineering

Vectorization

TF-IDF

Model

Results comparison

Convert the categorical data using Label Encoding

Save & re-run the file

Model Train-Test split

Model Building with ML algorithms

Creating a function for different ML models

Multinomial Naive Bayes

KNN

SVC-Linear

SVC-rbf

Decision-Tree

Random Forest

Bagging classifier

ExtraTreesClassifier

Adaboost Classifier

GBM

XGBoost

LGBM

Observation:

ML_Model comparison

Algorithm Performance Comparison in Tabular form

Hyperparameter tuning

Sampling/SMOTE techniques for Imbalanced data

Model Building on Smote-data

Multinomial NB

Adaboost

Xgboost

Model comparison

Sampling techniques- Oversamppling & Undersampling

Model Building on Smote-data

Multinomial NB

Model building with Deep learning Algorithms

Model build using NN classifiers(Target)

Model building with SMOTE data

Model building with RNN & LSTM classifier

Create model with text data with word embeddings

Data Preperation

Sequence Data Transformation :

Padding

In this section we will use GLOVE to convert text into numeric vector representation

NN algorithm

*From the above graph we can say model is not overfitted and a goodfit.

NN with (Bow)

Word2Vec

CBOW

NN for CBOW

Creating a model with categorical data

ML models

NN for Categorical data

Deployment

Using Streamlit Framework deployed on Heroku

Screenshots of Model Predictions using chat interface:

https://accident-level-predictor.herokuapp.com/

Conclusion